Prediction of ovarian cancer using artificial intelligence tools

Abstract Purpose Ovarian cancer is a common type of cancer and a leading cause of death in women. Therefore, accurate and fast prediction of ovarian tumors is crucial. One of the appropriate and precise methods for predicting and diagnosing this cancer is to build a model based on artificial intelligence methods. These methods provide a tool for predicting ovarian cancer according to the characteristics and conditions of each person. Method In this study, a data set included records related to 171 cases of benign ovarian tumors, and 178 records related to cases of ovarian cancer were analyzed. The data set contains the records of blood test results and tumor markers of the patients. After data preprocessing, including removing outliers and replacing missing values, the weight of the effective factors was determined using information gain indices and the Gini index. In the next step, predictive models were created using random forest (RF), support vector machine (SVM), decision trees (DT), and artificial neural network (ANN) models. The performance of these models was evaluated using the 10‐fold cross‐validation method using the indicators of specificity, sensitivity, accuracy, and the area under the receiver operating characteristic curve. Finally, by comparing the performance of the models, the best predictive model of ovarian cancer was selected. Results The most important predictive factors were HE4, CA125, and NEU. The RF model was identified as the best predictive model, with an accuracy of more than 86%. The predictive accuracy of DT, SVM, and ANN models was estimated as 82.91%, 85.25%, and 79.35%, respectively. Various artificial intelligence (AI) tools can be used with high accuracy and sensitivity in predicting ovarian cancer. Conclusion Therefore, the use of these tools can help specialists and patients with early, easier, and less expensive diagnosis of ovarian cancer. Future studies can leverage AI to integrate image data with serum biomarkers, thereby facilitating the creation of novel models and advancing the diagnosis and treatment of ovarian cancer.


| INTRODUCTION
Cancer is a malignancy characterized by high aggressiveness, low survival rates, and prolonged and costly treatment procedures.[3] Ovarian cancer ranks among the most common types of cancer impacting women.Every year, over 240,000 new cases of ovarian cancer are identified, and approximately 150,000 women lose their lives to this disease.Ovarian cancer consists of a diverse group of tumors that are categorized based on distinct histopathological and molecular characteristics.Epithelial ovarian cancer (EOC) is the predominant type of ovarian cancer.It can be categorized into four main subtypes based on tumor cell appearance: serous, endometrioid, clear cell, and mucinous.The significant morbidity and mortality associated with ovarian cancer can be attributed to the late detection of the disease and reduced effectiveness of surgical or pharmacological treatments.Ovarian cancer often presents with symptoms that appear late and are not specific, leading to up to 75% of cases being diagnosed at an advanced stage.Unfortunately, only about 20% of those diagnosed at this stage will survive for 5 years after diagnosis. 4,5fferent screening techniques like pelvic exams, transvaginal ultrasounds, CA125 cancer antigen tests, and magnetic resonance imaging (MRI) imaging are used to identify this disease.However, using any of these methods may not guarantee accurate diagnosis.
For instance, pelvic examination and ultrasound have low sensitivity and specificity, while CA125 marker levels may not rise in all patients with ovarian cancer.Furthermore, an expert specialist is required for accurate diagnosis through MRI imaging, which can be challenging.7][8] The development of predictive tools has enabled patients and medical practitioners to carry out diagnostic procedures more accurately and quickly while also enabling them to devise treatment plans that are well-suited to the specific needs of each patient.Artificial intelligence (AI) systems have gained widespread adoption as a result of their numerous benefits and can be employed to surmount the shortcomings of traditional diagnostic techniques.These systems have several advantages, such as their ability to handle large quantities of data, address instances of missing data, and adapt to new data inputs. 9,10 techniques have been increasingly utilized in recent times for precise diagnostic applications across diverse disease categories.In recent years, various AI tools, especially machine learning (ML) and deep learning have become popular for diagnosing and predicting various diseases, especially cancer, due to their advantages.For this reason, many studies have been published in this field. 1,7In addition, limited studies have been conducted concerning the prediction of ovarian cancer employing AI (ML) tools.However, due to the restrictions of these studies, the need for newer and more complete studies is felt. 11,12erefore, this study proposes the adoption of artificial intelligence-based systems as prediction tools for ovarian cancer.In this regard, a set of data will be extracted from a data set including the information of different patients, and AI methodologies will be employed to construct diversified models that can effectively predict ovarian cancer.The best-performing model will then be identified through subsequent evaluations.

| Data set
This study was conducted in 2022-2023.These data are available in a Mendeley Data repository "Using Machine Learning to Predict Ovarian Cancer" 13 located at https://data.mendeley.com/datasets/th7fztbrv9/11 for any academic, educational, or research purposes.
The data set includes the data of 349 patients with 49 characteristics as input (Table 1).

| Data analysis
The overall research steps are illustrated in Figure 1.The data analysis methodology involved the following steps: 1. Data preprocessing: The RapidMiner version 9.10 software was used to clean the data by replacing missing values, removing outliers, and normalization of the data.This step was crucial to ensure the accuracy of the subsequent analyses.

Factor weight determination:
The weight of factors affecting ovarian cancer was determined using Information Gain (IG) and Gini Index methods.These methods helped to identify the most important factors that contribute to the development of ovarian cancer.
3. Modeling: AI models were created using classification techniques such as random forest (RF), support vector machine (SVM), and decision tree (DT).The efficiency of the models was estimated using the indicators of accuracy, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve.The models were evaluated using 10-fold crossvalidation by specificity, sensitivity, accuracy, and ROC AUC indexes.The best model was selected based on its efficiency.
The implemented blocks in RapidMiner studio are presented in

| DT
DT is a machine-learning method that makes decisions based on the graphic structure of a DT.In this method, each node of the DT represents an attribute, and the tree is created based on the relationship between the attributes.Typically, a DT is formed using a T A B L E 1 Data set description.

| RF
RF is a machine-learning method that combines several DTs.In this method, a random forest consists of several DTs, each of which is trained independently using a random subset of features and data.
The main advantage of random forest is that by combining multitree decisions, it avoids single-tree decisions that may be incomplete, innumerable, and highly dependent on the training data.This method can be very useful and powerful for cases where the

| Gini index
The Gini index has demonstrated its effectiveness in identifying relevant features across various applications, including ML. Widely adopted in DT algorithms such as classification and regression tree and RF, the Gini index serves as a popular technique for feature selection.Additionally, a weighted Gini index can be used as a splitting criterion to address imbalanced data. 22,23| RESULTS

| Data set
This data set includes records related to 171 cases of benign ovarian tumors and 178 records related to cases of ovarian cancer.Figure 3 F I G U R E 2 The process designed in RapidMiner software to evaluate predictive models.
depicts the age distribution of both groups.The age of the samples is between 15 and 83, the average age of the samples is 45 years and their standard deviation is 15.1.

| Analysis factors affecting the differential diagnosis of ovarian cancer
The effective factors obtained by the IG method in the diagnosis of ovarian cancer malignancy are indicated in Figure 4 and the Gini Index method in Figure 5. Three of the most important influential factors in both IG and Gini Index techniques are HE4, CA125, and NEU.

| Evaluation of the effectiveness of predictive models for ovarian cancer
A comparison of the performance of these models based on accuracy, specificity, sensitivity, and area under the ROC curve is provided in  HE4 is a protein synthesized by the majority of epithelial ovarian cancer cells, although not all cells produce it.An HE4 test can be employed to monitor epithelial ovarian cancer posttreatment, detect recurrence or disease progression, but it is not recommended for screening asymptomatic women for ovarian cancer. 24,25High levels of HE4 are commonly found in the blood of women with epithelial ovarian cancer.7][28] Multiple studies have shown that the HE4 biomarker is not only valuable for diagnosing ovarian cancer but also for predicting prognosis and guiding therapy selection in patients. 24,28Furthermore, CA125 is widely recognized as a crucial biomarker for monitoring epithelial ovarian cancer.As a screening indicator, CA125 is employed to identify patients with ovarian cancer within the population and to distinguish it from benign conditions. 29125 has been a pivotal factor in the screening, treatment, and posttreatment monitoring stages of managing ovarian cancer. 302][33] Although CA125 can be used to diagnose ovarian cancer, modeling using ML methods and using other factors in addition to CA125 can help create predictive models with higher accuracy (compared to pure CA125).The findings of this study align with those of similar studies. 28,34,35I G U R E 3 Age distribution of patients with two groups of malignant ovarian cancer and benign ovarian tumor.
Similar studies have been conducted using ML methods in the field of ovarian cancer.For example, Ma et   38 Moreover, Ahamad et al. found that RF had a high performance for early-stage detection of ovarian cancer. 39The results of their studies align with the findings of our study.
We removed 10 data outliers from our data set due to the possibility of biasing the model's performance.Jin et al. also removed Factors affecting the diagnosis of malignant ovarian cancer obtained by the Information Gain method.
| 7 of 11 outliers in their study, stating that their removal can significantly increase the prediction capabilities of models. 40[43] The current study could introduce high-performance ML models that can be utilized for predicting ovarian cancer and overcome the limitations and disadvantages of clinical methods.This study has some limitations.First, the sample size of the data set used in the study was small, which may have limited the statistical power and generalizability of the results.Second, the data set used in the study was from a single center, which may limit the generalizability of the findings to other populations or settings.
F I G U R E 5 Factors affecting the diagnosis of malignant ovarian cancer obtained by the Gini Index method.
T A B L E 2 Comparing the performance of ovarian cancer prediction models.

| 3 of 11 set
of training data.During training, the DT algorithm identifies the optimal attribute for data splitting using metrics like entropy or Gini impurity.The objective is to maximize IG or minimize impurity after the split.By following each ray from the root to the terminal nodes, the samples move from the leaves to the root, and the final classification is determined based on the label of the leaves for each sample.The DT method has several advantages, including simplicity and high comprehensibility, the ability to check important features, the ability to use discrete and continuous input data, the ability to estimate any type of feature, and finally, the ability to check and evaluate complex conditions.DTs classify examples by traversing the tree from the root to a leaf or terminal node, where the final classification of the example is determined.The DT algorithm can be used for both classification tasks and regression tasks.Since DTs replicate human thought processes, data scientists usually find it straightforward to comprehend and interpret the outcomes.DT algorithms possess significant capabilities in data classification and evaluating the expenses, risks, and potential advantages associated with concepts.14,15 number of features is large and changeable.The RF algorithm constructs a forest by training DTs using bagging (bootstrap aggregating).Bagging is an ensemble meta-algorithm that enhances the accuracy of ML models.The algorithm determines the final prediction by averaging the outputs from multiple trees.Increasing the number of trees improves the precision of the outcome.The ensemble of trees outputs the mode or mean of individual trees, resulting in greater accuracy and stability by leveraging multiple trees instead of relying on a single DT.

4 |
DISCUSSIONIn this study, the influencing factors on the differential diagnosis of ovarian cancer were investigated, and among the 49 investigated characteristics, the most effective factors obtained by the IG method and Gini index, respectively, are HE4, CA125, NEU, and age.In addition, based on the available data, DT, SVM, RF, and ANN models were generated and compared in terms of accuracy, sensitivity, specificity, f-measure, and AUC parameters, that the random forest model was able to provide the highest performance compared to other models.

for ovarian cancer up to 3
years before diagnosis.Additionally, AI outperforms previous tools in predicting cancer survival rates.In conclusion, AI and ML exhibit promise for diagnosing, predicting, and potentially treating various medical conditions, including cancer.AI-based algorithms provide efficient and uncomplicated solutions to aid medical professionals in their decision-making processes, offering a cost-effective and practical approach to improve healthcare outcomes.AUTHOR CONTRIBUTIONS Seyed Mohammad Ayyoubzadeh and Mahnaz Ahmadi analyzed and interpreted the data, Seyed Mohammad Ayyoubzadeh performed the analysis of the data, and All authors were contributors to writing the manuscript.All authors read and approved the final manuscript.Seyed Mohammad Ayyoubzadeh and Mahnaz Ahmadi had full access to all of the data in this study and took complete responsibility for the integrity and accuracy of the data analysis.ACKNOWLEDGMENTS This study has been funded and supported by Tehran University of Medical Sciences (TUMS); Grant no.1401-4-102-63824.

F I G U R E 6
Receiver operating characteristic curve diagram of random forest model.

Table 2 .
The RF model was identified as the best predictive model